Data transfer between storage systems using data fingerprints

ABSTRACT

A system and method for data replication is described. A destination storage system receives a message from a source storage system as part of a replication process. The message includes an identity of a first file, information about where the first file is stored in the source storage system, a name of a first data being used by the first file and stored at a first location of the source storage system, and a fingerprint of the first data. The destination storage system determines that a mapping database is unavailable or inaccurate, and accesses a fingerprint database using the fingerprint of the first data received with the message to determine whether data stored in the destination storage system has a fingerprint identical to the fingerprint of the first data.

RELATED APPLICATIONS

This application claims priority to and is a continuation of U.S.application Ser. No. 14/195,509, filed on Mar. 3, 2014, now allowed,titled “DATA TRANSFER BETWEEN STORAGE SYSTEMS USING DATA FINGERPRINTS,”which is incorporated herein by reference.

BACKGROUND

A source storage system can perform a data replication process to causedata to be transferred from the source storage system to a destinationstorage system. The destination storage system can maintain a databasethat maps the data between the source storage system and the destinationstorage system for subsequent data replication processes. In someinstances, however, the database can become unavailable or inaccuratewhen an operation takes place on either the source storage system or thedata storage system that alters the mapping of the data between thestorage systems. In such case, subsequent data replication processes canbecome inefficient as the previously transferred data is not detected asbeing already received by the destination storage system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example system to perform data replication usingdata fingerprints.

FIG. 2 illustrates an example method for performing data replicationusing data fingerprints.

FIGS. 3A through 3C illustrate example databases used by a destinationstorage system.

FIG. 4 is a block diagram that illustrates a computer system upon whichexamples described herein may be implemented.

DETAILED DESCRIPTION

Examples described herein provide for a replication system that can usedata fingerprints to preserve data replication efficiency in situationswhere a mapping database between two storage systems is unavailable orinaccurate. A mapping database can operate as a translation table thatmaps data stored at source-side data storage locations to replicateddata stored at destination-side data storage locations. Subsequentreplication processes can use the mapping database to preventduplication of data. When the mapping database is unavailable orinaccurate, the destination storage system can use data fingerprints toidentify similar blocks of data and then verify whether the blocks ofdata are identical without transferring the data between systems.

In one example, a system, such as a destination storage system, canreceive a replication message as part of a data replication process froma source storage system. As used herein, a “source storage system” canrefer to a storage system that is a source of a data replicationprocess, and a “destination storage system” can refer to a storagesystem that is a destination or target of the data replication processin which data from the source storage system is to be transferred orcopied to. The message can include (i) an identity of a first file, (ii)information about where the first file is stored in the source storagesystem, (iii) a name of a first data being used by the first file andstored at a first location of the source storage system, and (iv) afingerprint of the first data. In response to receiving the replicationmessage from the source storage system, the destination storage systemcan determine whether its mapping database is unavailable or inaccurate(or corrupt).

The mapping database can be determined to be unavailable or inaccurateif an operation occurred that altered the mapping of the data betweenthe source and destination storage system (e.g., as a result of localnames being changed on the source storage system and/or the destinationstorage system). For example, the mapping database can be no longeruseful if either the source-side or destination-side file system ismoved from a previous location to a new location. In response todetermining that the mapping database is unavailable or inaccurate, thedestination storage system can access a fingerprint database using thefingerprint of the first data received with the replication message todetermine whether data stored in the destination storage system has afingerprint identical to the fingerprint of the first data. Thedestination storage system can maintain the fingerprint database thatstores a plurality of entries of fingerprints. Each entry can correspondto data stored in the destination storage system and can include (i) arespective fingerprint for that data, (ii) an identity of a file thatuses that data, and (iii) respective information about where that fileis stored in the source storage system.

According to some examples, if a second data stored in the destinationstorage system has a fingerprint identical to the fingerprint of thefirst data, the destination storage system can determine that the firstdata and the second data are at least similar, and can transmit to thesource storage system, a request message to confirm whether the firstdata is identical to the second data that is already stored in thedestination storage system. The request message can include (i) anidentity of a second file that uses the second data and (ii) respectiveinformation about where the second file is stored in the source storagesystem. In this manner, the destination storage system can ask thesource storage system to check whether the first data is identical tothe second data without having to transfer the second data to the sourcestorage system.

If the destination storage system receives from the source storagesystem a response message that indicates that the first data isidentical to the second data stored in the destination storage system,the destination storage system can generate or update its mappingdatabase accordingly. For example, the destination storage system cangenerate or update its mapping database by associating the name of thefirst data stored at the first location of the source storage system andthe local location where the second data is stored in the destinationstorage system.

One or more examples described herein provide that methods, techniques,and actions performed by a computing device are performedprogrammatically, or as a computer-implemented method. Programmatically,as used herein, means through the use of code or computer-executableinstructions. These instructions can be stored in one or more memoryresources of the computing device. A programmatically performed step mayor may not be automatic.

One or more examples described herein can be implemented usingprogrammatic modules, engines, or components. A programmatic module,engine, or component can include a program, a sub-routine, a portion ofa program, or a software component or a hardware component capable ofperforming one or more stated tasks or functions. As used herein, amodule or component can exist on a hardware component independently ofother modules or components. Alternatively, a module or component can bea shared element or process of other modules, programs or machines.

Some examples described herein can generally require the use ofcomputing devices, including processing and memory resources. Examplesdescribed herein may be implemented, in whole or in part, on computingdevices such as servers, desktop computers, cellular or smartphones,personal digital assistants (e.g., PDAs), laptop computers, printers,digital picture frames, network equipments (e.g., routers) and tabletdevices. Memory, processing, and network resources may all be used inconnection with the establishment, use, or performance of any exampledescribed herein (including with the performance of any method or withthe implementation of any system).

Furthermore, one or more examples described herein may be implementedthrough the use of instructions that are executable by one or moreprocessors. These instructions may be carried on a computer-readablemedium. Machines shown or described with figures below provide examplesof processing resources and computer-readable mediums on whichinstructions for implementing examples can be carried and/or executed.In particular, the numerous machines shown with examples includeprocessor(s) and various forms of memory for holding data andinstructions. Examples of computer-readable mediums include permanentmemory storage devices, such as hard drives on personal computers orservers. Other examples of computer storage mediums include portablestorage units, such as CD or DVD units, flash memory (such as carried onsmartphones, multifunctional devices or tablets), and magnetic memory.Computers, terminals, network enabled devices (e.g., mobile devices,such as cell phones) are all examples of machines and devices thatutilize processors, memory, and instructions stored on computer-readablemediums. Additionally, examples may be implemented in the form ofcomputer-programs, or a computer usable carrier medium capable ofcarrying such a program.

System Description

FIG. 1 illustrates an example system to perform data replication usingdata fingerprints. A destination storage system can use datafingerprints in order to determine whether data that is to be replicatedfrom a source storage system is already stored in the destinationstorage system. This enables the destination storage system to preservestorage efficiency (e.g., by not storing redundant data) during areplication process even when its mapping database (that maps datastored at source-side data storage locations to data stored atdestination-side data storage locations) is inaccurate or corrupt.

According to an example, system 100, such as a destination storagesystem, can include a replication manage 110, a storage system interface160, a fingerprint database 150, a mapping database 140, and a datastore 170. Depending on implementation, one or more components of system100 can be implemented on a computing device, such as a server, laptop,PC, etc., or on multiple computing devices that can communicate with afleet or set of devices over one or more networks. System 100 can alsobe implemented through other computer systems in alternativearchitectures (e.g., peer-to-peer networks, etc.). Logic can beimplemented with various applications (e.g., software) and/or withfirmware or hardware of a computer system that implements system 100.

System 100 can also communicate, over one or more networks via a networkinterface (e.g., wirelessly or using a wireline), with one or more otherstorage systems, such as a source storage system 180, using a storagesystem interface 160. The storage system interface 160 can enable andmanage communications between system 100 and the source storage system180. Data that is to be replicated can also be transmitted between thesystems 100, 180 using the storage system interface 160.

In one example, a source storage system 180 can store a plurality ofdifferent data using a source-side file system, and system 100 can beused to backup the data of the source storage system 180. In such acase, system 100 can use a mapping database, such as the mappingdatabase 140, to map the source-side storage locations (where data isstored in the source storage system 180) to its destination-side storagelocations (where a copy of that data is stored). In this manner, amirror of the file system of the source storage system 180 can bemaintained at system 100 for a long period of time, while beingincrementally updated (e.g., perform a data replication process inresponse to a user input, or periodically every day, every few days,every week, etc.).

According to some examples, the source storage system 180 and/or system100 can execute a storage operating system that implements a file layoutthat supports high volume data storage and that provides a mechanism toenable file systems to access disk blocks. An example of a file layoutcan be the Write Anywhere File Layout (WAFL) from NetApp Inc., ofSunnyvale, Calif., which enables detecting and sharing of regions ofdata between two storage systems at a 4 KB block granularity usingvirtual volume block numbers (VVBN). Another example of a file layoutcan be NetApp Inc.'s MetaWAFL, which enables detecting and sharing ofregions of data between two storage systems that is not exactly 4 KBblock in length or alignment by using variable-length extents (where anextent can be a contiguous area of storage). In the example of FIG. 1,the source storage system 180 and/or system 100 can execute a storageoperating system that implements MetaWAFL.

When replicating a file system between the source storage system 180 andsystem 100, a logical replication model can be used, for example, sothat data that is to be transferred from the source storage system 180to system 100 can be described using a message. As part of a datareplication process between the source storage system 180 and system100, the source storage system 180 can transmit a replication message181 to system 100. According to an example, the replication message 181can include an identity 183 of a first file that is to be replicated(e.g., the file's inode number and/or generation number), fileinformation 185 about where the first file is stored in the sourcestorage system, a name 187 of a first data that is used by the firstfile and stored at a first location of the source storage system, and afingerprint 189 of the first data. The replication message 181 canindicate to system 100, for example, that “File X, which starts atoffset Y and has length Z, uses data named Foo1 that has a fingerprintABCD,” where File X is the file name 183, offset Y and length Z is thefile information 185, Foo1 is the virtual volume block number or anextent where that data is stored at the source (e.g., the data name187), and ABCD is the fingerprint 189 of the data.

The replication manage 110 can receive the replication message 181 viathe storage system interface 160. In one example, the replication manage110 can include a database check 115, which can access the mappingdatabase 140 of system 100 and determine whether the mapping database140 is unavailable (e.g., does not exist or has been moved or deleted)or inaccurate (e.g., is corrupt or has an incorrect mapping entry). Insome examples, multiple mapping databases 140 can be used by system 100for backing up multiple storage systems. The database check 115 can senda query 112, for example, to the mapping database 140 to determine theavailability and/or accuracy of the mapping database 140 correspondingto the source storage system 180. System 100 can use the mappingdatabase 140 as a translation table that maps data stored at the sourcestorage system 180 to replicated data stored at system 100.

For example, if the mapping database 140 is available and accurate, thedatabase check 115 can use the information from the replication message181 (e.g., for File X, which uses data named Foo1) and access themapping database 140 to see if system 100 has already received the datathat the source storage system 180 has named Foo1. If the mappingdatabase 140 includes an entry corresponding to Foo1 that shows that thedata is stored at a local location (e.g., named Location5) of system100, system 100 does not need to receive the corresponding data from thesource storage system 180. The database check 115 can transmit, forexample, a status message 116 to the source storage system 180 that thecorresponding data is not needed because system 100 already has it.

On the other hand, if the mapping database 140 indicates that system 100has not received the data that the source storage system 180 has namedFoo1 (e.g., the mapping database 140 does not include an entry forFoo1), the replication manage 110 can ask the source storage system 180to send the data (e.g., via the status message 116). The replicatecomponent 125 of the replication manage 110 can receive the data 195,select a local storage location in the data store 170 of system 100, andwrite the data 195 to the local storage location. The database update130 of the replication manage 110 can use the replication information128 (such as the source-side data name and the local storage location)to update the mapping database 140 accordingly, so that any futurereferences to the data named Foo1 can be made to the local storagelocation.

As discussed, in some situations, the mapping database 140 can beunavailable or inaccurate when an operation takes place on either thesource storage system 180 or system 100 that alters the mapping of thedata between the storage systems. Referring to the example, discussed,the database check 115 can receives the replication message 181 (e.g.,“File X, which starts at offset Y and has length Z, uses data named Foo1that has a fingerprint ABCD”), query the mapping database 140, anddetermine that that the mapping database 140 is unavailable orinaccurate. In such case, the database check 115 can transmit a failmessage 114 to the fingerprint check 120 of the replication manage 110.The fail message 114 can indicate that the mapping database 140 cannotbe used or is unable to resolve the source-side name Foo1 and cause thereplication manage 110 to access or consult another destination-sideindexing data structure, such as a fingerprint database 150.

A fingerprint database 150 can store a plurality of entries offingerprints, where each entry corresponds to data previously receivedand stored in system 100. Each entry can include a respectivefingerprint for the data, the identity of the file that uses the data,and respective information about where that file is stored in the sourcestorage system 180. According to some examples, a fingerprint cancorrespond to a checksum value or hash sum value of the data. Typically,different data can result in a different checksum value. If a firstchecksum or fingerprint matches a second checksum or fingerprint, thereis a high probability that the data that resulted in the first checksumis the same as the data that resulted in the second checksum. In someexamples, the fingerprint database 150 can include duplicatefingerprints.

In response to receiving the fail message 114 (indicating that themapping database 140 is unavailable or inaccurate), the fingerprintcheck 120 can use the fingerprint 189 of the first data received fromthe source storage system 180 (e.g., fingerprint ABCD) and access thefingerprint database 150 to determine whether the fingerprint 189matches a fingerprint in the fingerprint database 150. Depending onimplementation, the fingerprint check 120 can receive the fingerprint189 of the first data when the replication message 181 is received bythe replication manage 110 or receive the fingerprint 189 of the firstdata with the fail message 114 from the database check 115. Thefingerprint 189 of the first data can be a checksum value that has beengenerated by the source storage system 180 using a checksum function ora checksum algorithm.

The fingerprint check 120 can perform a lookup in the fingerprintdatabase 140 by comparing the fingerprint 189 with the fingerprintentries stored in the fingerprint database 140. If the fingerprint check120 does not find a matching fingerprint in the fingerprint database140, the replication manage 110 determines that the corresponding firstdata has not been received by system 100. The fingerprint check 120 cantransmit a status message 122 to the source storage system 180 that thecorresponding first data has not been received by system 100. Referringto the example discussed, the replication message 181 that wastransmitted as part of a replication process specified that “File X,which starts at offset Y and has length Z, uses data named Foo1 that hasa fingerprint ABCD.” Because a matching fingerprint was not found in thefingerprint database 140, the fingerprint check 120 can request thesource storage system 180 to transfer File X and data named Foo1 tosystem 100.

The replicate component 125 can receive the data 195 (e.g., File X andFoo1), select a local storage location in the data store 170 of system100, and write the data 195 to the local storage location. The databaseupdate 130 of the replication manage 110 can use the replicationinformation 128 (such as the source-side data name and the local storagelocation) to update the mapping database 140 and to also update thefingerprint database 150. The database update 130 can add an entry inthe fingerprint database 150 that corresponds to the fingerprint 189(“ABCD”), the file name 183 (“File X”), and the file information 185(“offset Y, length Z”).

On the other hand, if the fingerprint check 120 finds a fingerprint thatmatches the fingerprint 189 of the first data, the replication manage110 determines that other data (e.g., second data) stored at system 100has been found that is similar to the first data. For example, therequest message 124 can specify that while the source storage system 180asked system 100 to use data block named Foo1 for File X, which startsat offset Y and has length Z, similar data block (having the samefingerprint ABCD) is stored at system 100 that is used by a second file,File O, that is stored in the source storage system 180. The fingerprintcheck 120 can transmit a request message 124 to the source storagesystem 180 that asks the source storage system 180 to verify that thesecond data stored at system 100 is identical to the first data.

In one example, the request message 124 can include (i) the identity ofa second file that uses the second data, and (ii) respective informationabout where the second file is stored in the source storage system 180.For example, the request message 124 can include an identity of a secondfile (e.g., “File O”) that was previously received from the sourcestorage system 180, with a particular offset P, length Q, and alsohaving the fingerprint ABCD. In this manner, a request message 124 canbe used to verify whether or not the first data and the second data areidentical without having system 100 transmit the second data itself tothe source storage system 180.

The source storage system 180 can receive the request message 124 andinvestigate its own version of File O, at offset P, length Q todetermine what its local source-side VVBN or extent is being used tostore the data used by File O. If the source storage system 180determines that File O is using data with the source-side name Foo1, thesource storage system 180 can provide a response message 190 to thereplication manage 110 that it is sharing data between File X, at offsetY, length Z, and File O, at offset P, length Q. For example, File X andFile O can each be a document file that uses an image that is stored atsource-side location Foo1. The response message 190 can instruct system100 to establish an association with File O, at offset P, length Q, withthe data named Foo1, and to also establish an association with File X,at offset Y, length Z, with the data named Foo1.

The replication manage 110 can receive the request message 124indicating that the first data is identical to the second data that isstored at system 100. Because the second data already stored at system100 is identical to the first data, system 100 does not need to receiveanother copy of the first data named Foo1. The database update 130 cangenerate or update the mapping database 140 to include an entry thatassociates (i) the name of the first data stored at a first location ofthe source storage system (e.g., “Foo1”), and (ii) the local locationwhere the second data is stored in the destination storage system (e.g.,“Location5”). In this manner, the mapping database 140 can be generatedand/or updated with accurate and up-to-date information for use withfuture replication processes.

The database update 130 can also update the fingerprint database 150 toinclude an entry corresponding to the replication message 181. Theupdated entry can include the fingerprint of the first data (“ABCD”),the identity of the first file that uses the data (“File X”), and theinformation about where the first file is stored in the source storagesystem 180 (“offset Y, length Z”). As an addition or an alternative,system 100 can determine if File X is needed, and if File X is not yetstored in the data store 170 of system 100, the replication manage 110can request the source storage system 180 for File X. The replicatecomponent 125 can receive and store the file, and subsequently, thedatabase update 130 can update the mapping database 140 and thefingerprint database 150 with the association information between thesource storage system 180 and system 100.

Referring back to the example, the source storage system 180 can receivethe request message 124 and investigate its own version of File O, atoffset P, length Q to determine what its local source-side VVBN orextent is being used to store the data used by File O. If, however, thesource storage system 180 determines that the first data named Foo1 isnot the data being used by File O (e.g., File O uses a source-side dataname different than Foo1), the source storage system 180 can provide aresponse message 190 to the replication manage 110 that the first datais not identical to the second data. Depending on implementation, thesource storage system 180 can transfer, concurrently with the responsemessage 190 or separately, the first file (“File X”) and the first databeing used by the first file (“Foo1”). The replicate component 125 canreceive and store the first file and the first data, and the databaseupdate 130 can update the mapping database 140 and the fingerprintdatabase 150 with the association information between the source storagesystem 180 and system 100.

Methodology

FIG. 2 illustrates an example method for performing data replicationusing data fingerprints. A method such as described by an example ofFIG. 2 can be implemented using, for example, components described withan example of FIG. 1. Accordingly, references made to elements of FIG. 1are for purposes of illustrating a suitable element or component forperforming a step or sub-step being described. In addition, FIGS. 3A and3B illustrate example databases used by a destination storage system.The databases, such as described by FIGS. 3A and 3B, can be used by, forexample, components described with an example of FIG. 1. References toFIGS. 3A and 3B are made with respect to the example method of FIG. 2.

As an example, a source storage system (“source”) can communicate with adestination storage system (“destination”) for purposes of backing up orreplicating data. For purposes of describing the method of FIG. 2, it isassumed that an initial or previous replication process has occurredthat involved transferring data references for three different filesfrom the source to the destination. The source may have sent threereplication messages for the three files, as well as the files and datathat the files used to the destination. For example, a first replicationmessage can indicate to the destination that for File O, which starts atoffset P and has length Q, the source is using data named Foo1 that hasa fingerprint ABCD. A second replication message can indicate to thedestination that for File R, which starts at offset S and has length T,the source is using data named Bari that has a fingerprint EFGH. A thirdreplication message can indicate to the destination that for File U,which starts at offset V and has length W, the source is using datanamed Bart that has a fingerprint ABCD. In this example, two files, FileR and File U are sharing the same region or location of data at thesource.

Having received these replication messages, the destination would havegenerated and/or updated the mapping database with entries that map thesource-side data name with the destination-side local location. Thedestination would have also updated the fingerprint database with threeentries. FIG. 3A illustrates an example fingerprint database 300 withthree entries 310 corresponding to the three replication messagesreceived by the destination. Each entry can include (i) a fingerprint,(ii) the identity of a file that uses the data with that fingerprint,and (iii) information about the file, such as the offset and length.

Referring to FIG. 2, the destination can receive another message as partof a replication process from the source (210). The message can be areplication message 181 as part of a subsequent replication processbetween the source and the destination. The replication message 181 caninclude (i) an identity of a first file, (ii) information about wherethe first file is stored at the source, (iii) a name of a first databeing used by the first file and stored at a first location of thesource, and (iv) a fingerprint of the first data. For example, thereplication message 181 can indicate to the destination that for File X,which starts at offset Y and has length Z, the source is using datanamed Foo1 that has a fingerprint ABCD. In response to receiving thereplication message 181, the destination can determine whether itsmapping database 140 is unavailable or inaccurate (220).

The mapping database 140 maps the source-side name to thedestination-side name. If the mapping database 140 is available andaccurate, the database check 115 accesses the mapping database 140 todetermine whether the destination has already received the data that thesource named Foo1 (225). Depending on whether or not an entry exists inthe mapping database 140 for Foo1, the destination can communicate withthe source to either (i) notify the source that the data the sourcenamed Foo1 has already been received, or (ii) request the source to sendthe data because the data has not been received yet by the destination(227). If the destination does not have the data, the replication manage110, for example, can request the source to send the data, receive thedata from the source, select a destination location to store the data,and update the mapping database 140 and a fingerprint database 150 withup-to-date mapping information and up-to-date fingerprint information,respectively.

Referring back to 220, if, on the other hand, the mapping database 140is unavailable or inaccurate, the destination can access the fingerprintdatabase 150 to determine whether data stored in the destination has afingerprint identical to the fingerprint of the first data (e.g.,“ABCD”) (230). For example, the database check 115 can transmit a failnotification 114 to the fingerprint check 120, which then uses thefingerprint of the first data to search the fingerprint entries in thefingerprint database 150. A fingerprint can correspond to a checksumvalue or hash sum value of the data. Because different data typicallyresults in different checksum values, if a first checksum or fingerprintmatches a second checksum or fingerprint, there is a high probabilitythat the data that resulted in the first checksum is the same as thedata that resulted in the second checksum. In this manner, thedestination can determine or identify whether any existing data similarto the first data is stored at the destination.

The fingerprint check 120 can determine whether there is a matchingfingerprint to the fingerprint of the first data in the fingerprintdatabase 150 (240). If no match is found, the destination cancommunicate with the source to notify the source that the data has notbeen received yet by the destination (245). The replication manage 110can request the source to send the data, receive the data from thesource, select a destination location to store the data, and update themapping database 140 and a fingerprint database 150 with up-to-datemapping information and up-to-date fingerprint information, respectively(247).

However, if a match is found, the fingerprint check 120 can transmit arequest message to the source for confirmation (250). For example,referring again to FIG. 3A, which illustrates the fingerprint database150 of the destination after previous replication process(es), thefingerprint check 120 may perform a lookup of the fingerprint of thefirst data (e.g., “ABCD”) in the fingerprint database 150 by comparingthe fingerprint with the fingerprint entries stored in the fingerprintdatabase 150. The fingerprint check 120 can determine that a previouslyreceived File O, which has offset P, length Q, uses data (e.g., seconddata) similar to the first data because that data also has fingerprintABCD. Because fingerprints of data are not guaranteed to be perfectlyunique (e.g., it is possible for two dissimilar sets of data to generateidentical checksum values), the fingerprint check 120 can transmit arequest message to the source asking the source to confirm that the datastored at the destination is identical to the first data. The requestmessage can specify that the source instructed the destination to usedata named Foo1 for File X, offset Y, length Z, but that similar data(e.g., second data) has been found at the destination used by File O,offset P, length Q. The request message can ask the source to confirmwhether the first data and the second data are identical.

As an addition or an alternative, the fingerprint database 150 can alsoinclude, for each fingerprint entry, the source's snapshot identifierfor each file as it is being transferred. This snapshot identifier canbe passed back or transmitted from the destination to the source whenrequesting the source to verify that the second data is identical to thefirst data. The snapshot identifier provides the source with anotherchecking mechanism for verifying data in order to find the originalreference (e.g., File O) at the source. For example, although not shownin FIG. 3A, each entry of the fingerprint database 300 can include (i) afingerprint, (ii) the identity of a file that uses the data with thatfingerprint, (iii) information about the file, such as the offset andlength, and (iv) a snapshot identifier for the file.

Referring back to FIG. 2, after the destination sends the requestmessage for confirmation, the source can investigate its file system andprovide a response message to the destination indicating whether thefirst data and the second data are identical or not (260). If the sourcedetermines that File O uses a different source-side name than Foo1, thenthe source has successfully avoided being fooled by a hash collision. Insome examples, the source can investigate other possible matches thatthe destination advises it. The source can provide a response message tothe destination that the first data is not identical to the second data.The destination can receive and store the first file and the first data,and update the mapping database 140 and the fingerprint database 150with the association information between the source and the destination(270).

However, if the source determines that File O is in fact using data withthe source-side name Foo1, the source can provide a response message tothe destination that it is sharing data between File X, at offset Y,length Z, and File O, at offset P, length Q. The response message caninstruct the destination to establish an association with File O, atoffset P, length Q, with the data named Foo1, and to also establish anassociation with File X, at offset Y, length Z, with the data namedFoo1. The destination can then generate or update the mapping databaseas well as the fingerprint database accordingly (270).

For example, the destination can update the fingerprint database toinclude entry 360, such as shown in the fingerprint database 350 of FIG.3B. The entry 360 can correspond to the response message from the sourceinstructing the destination to establish an association with File X, atoffset Y, length Z, with the data named Foo1. In addition, thedestination can generate or update its mapping database, such as themapping database 380 as illustrated in FIG. 3C. The mapping database 380can include an entry 390 that corresponds to the source-side name Foo1and the destination-side location Location5 (where the correspondingdata is stored at the destination). In this manner, the destination cangenerate or update the mapping database to include up-to-dateinformation so that future subsequent replication processes between thesource and destination can first use the mapping database for to achievestorage efficiency (e.g., by finding data already received and nothaving to perform unnecessary data transfers).

Hardware Diagram

FIG. 4 is a block diagram that illustrates a computer system upon whichexamples described herein may be implemented. For example, in thecontext of FIG. 1, system 100 may be implemented using a computer systemsuch as described by FIG. 4. System 100 may also be implemented using acombination of multiple computer systems as described by FIG. 4.

In one implementation, computer system 400 includes processing resources410, main memory 420, ROM 430, storage device 440, and communicationinterface 450. Computer system 400 includes at least one processor 410for processing information and a main memory 420, such as a randomaccess memory (RAM) or other dynamic storage device, for storinginformation and instructions to be executed by the processor 410. Mainmemory 420 also may be used for storing temporary variables or otherintermediate information during execution of instructions to be executedby processor 410. Computer system 400 may also include a read onlymemory (ROM) 430 or other static storage device for storing staticinformation and instructions for processor 410. A storage device 440,such as a magnetic disk or optical disk, is provided for storinginformation and instructions. For example, the storage device 440 cancorrespond to a computer-readable medium that stores data replicationinstructions 442 that, when executed by processor 410, may cause system400 to perform operations described below and/or described above withrespect to FIGS. 1 and 2 (e.g., operations of system 100 describedabove).

The communication interface 450 can enable computer system 400 tocommunicate with one or more networks 480 (e.g., computer network,cellular network, etc.) through use of the network link (wireless orwireline). Using the network link, computer system 400 can communicatewith a plurality of systems, such as other data storage systems. In oneexample, computer system 400 can receive a replication message 452 froma source storage system (not shown) via the network link. When theprocessor 410 determines that a mapping database of the computer system400 is unavailable or inaccurate, the processor 410 can access afingerprint database using a fingerprint of first data received with thereplication message 452. The processor 410 can determine whether datastored in the computer system 400 has a fingerprint that is identical tothe received fingerprint. If the processor 410 determines that seconddata stored in the computer system 400 has an identical fingerprint, theprocessor 410 can transmit to the source storage system, over thenetwork 480, a request message 454 to confirm whether the first data isidentical to the second data stored in the computer system 400.

Computer system 400 can also include a display device 460, such as acathode ray tube (CRT), an LCD monitor, or a television set, forexample, for displaying graphics and information to a user. An inputmechanism 470, such as a keyboard that includes alphanumeric keys andother keys, can be coupled to computer system 400 for communicatinginformation and command selections to processor 410. Other non-limiting,illustrative examples of input mechanisms 470 include a mouse, atrackball, touch-sensitive screen, or cursor direction keys forcommunicating direction information and command selections to processor410 and for controlling cursor movement on display 460.

Examples described herein are related to the use of computer system 400for implementing the techniques described herein. According to oneexample, those techniques are performed by computer system 400 inresponse to processor 410 executing one or more sequences of one or moreinstructions contained in main memory 420. Such instructions may be readinto main memory 420 from another machine-readable medium, such asstorage device 440. Execution of the sequences of instructions containedin main memory 420 causes processor 410 to perform the process stepsdescribed herein. In alternative implementations, hard-wired circuitrymay be used in place of or in combination with software instructions toimplement examples described herein. Thus, the examples described arenot limited to any specific combination of hardware circuitry andsoftware.

It is contemplated for examples described herein to extend to individualelements and concepts described herein, independently of other concepts,ideas or system, as well as for examples to include combinations ofelements recited anywhere in this application. Although examples aredescribed in detail herein with reference to the accompanying drawings,it is to be understood that the concepts are not limited to thoseprecise examples. Accordingly, it is intended that the scope of theconcepts be defined by the following claims and their equivalents.Furthermore, it is contemplated that a particular feature describedeither individually or as part of an example can be combined with otherindividually described features, or parts of other examples, even if theother features and examples make no mentioned of the particular feature.Thus, the absence of describing combinations should not preclude havingrights to such combinations.

What is being claimed is:
 1. A method comprising: receiving a messagecomprising an identity of a first file, a file storage location of wherethe first file is stored in a first storage system, a name of a firstdata used by the first file, and a fingerprint of the first data,wherein the first data is stored at a first location of the firststorage system; determining that a mapping database of a second storagesystem, which maps storage locations of data in the first storage systemto storage locations of replicated data in the second storage system, isunavailable; and accessing a fingerprint database of the second storagesystem using the fingerprint to determine whether replicated data storedin the second storage system has a second fingerprint identical to thefingerprint of the first data, wherein the fingerprint database storesentries of fingerprints of replicated data stored by the second storagesystem, wherein an entry of a first replicated data comprises acorresponding fingerprint for the first replicated data, a correspondingidentity of a file at the first storage system that uses the firstreplicated data, and a corresponding file storage location of where thefile is stored in the first storage system.
 2. The method of claim 1,wherein the corresponding file storage location, within the entry of thefingerprint database, corresponds to an offset value and a length value.3. The method of claim 1, wherein the fingerprint of the first datacorresponds to a checksum value generated by the first storage systemusing a checksum function.
 4. The method of claim 1, comprising:transmitting, to the first storage system, a request message to confirmwhether the first data is identical to the replicated data having thesecond fingerprint.
 5. The method of claim 1, wherein accessing thefingerprint database includes performing a lookup in the fingerprintdatabase by comparing the fingerprint of the first data with the entriesof fingerprints stored in the fingerprint database.
 6. The method ofclaim 4, wherein the request message includes a second identity of asecond file that uses the replicated data.
 7. The method of claim 4,comprising: receiving a response message indicating that the first datais identical to the replicated data; and updating the mapping databaseby associating the name of the first data stored at the first locationby the first storage system with a local location where the replicateddata is stored by the second storage system.
 8. The method of claim 7,comprising: updating the fingerprint database to include a new entry forthe first file, the new entry comprising the fingerprint of the firstdata, the identity of the first file, and the file storage location ofwhere the first file is stored.
 9. The method of claim 4, furthercomprising: receiving a response message indicating that the first datais not identical to the replicated data; receiving the first file andthe first data; storing the first data within a local location in thesecond storage system; updating the mapping database by associating thename of the first data stored at the first location of the first storagesystem with the local location where the first data is stored in thesecond storage system; and updating the fingerprint database to includea new entry for the first file, the new entry including the fingerprintof the first data, the identity of the first file, and the file storagelocation of where the first file is stored in the first storage system.10. A non-transitory computer-readable medium comprising instructionsthat, when executed by a processor, cause the processor to performoperations comprising: receiving a message comprising an identity of afirst file, a file storage location of where the first file is stored ina first storage system, a name of a first data used by the first file,and a fingerprint of the first data, wherein the first data is stored ata first location of the first storage system; determining that a mappingdatabase of a second storage system, which maps storage locations ofdata in the first storage system to storage locations of replicated datain the second storage system, is unavailable; and accessing afingerprint database of the second storage system using the fingerprintto determine whether replicated data stored in the second storage systemhas a second fingerprint identical to the fingerprint of the first data,wherein the fingerprint database stores entries of fingerprints ofreplicated data stored by the second storage system, wherein an entry ofa first replicated data comprises a corresponding fingerprint for thefirst replicated data, a corresponding identity of a file at the firststorage system that uses the first replicated data, and a correspondingfile storage location of where the file is stored in the first storagesystem.
 11. The non-transitory computer-readable medium of claim 10,wherein the file storage location corresponds to an offset value and alength value associated with the first storage system.
 12. Thenon-transitory computer-readable medium of claim 10, wherein thefingerprint of the first data corresponds to a checksum value generatedby the first storage system using a checksum function.
 13. Thenon-transitory computer-readable medium of claim 10, wherein thecorresponding file storage location, within the entry of the fingerprintdatabase, comprises an offset value and a length value associated withthe first storage system.
 14. The non-transitory computer-readablemedium of claim 10, wherein the operations comprise: accessing thefingerprint database by performing a lookup in the fingerprint databaseby comparing the fingerprint of the first data with the entries offingerprints stored in the fingerprint database.
 15. The non-transitorycomputer-readable medium of claim 10, wherein the operations comprise:transmitting, to the first storage system, a request message to confirmwhether the first data is identical to the replicated data having thesecond fingerprint.
 16. The non-transitory computer-readable medium ofclaim 10, wherein the mapping database comprises a mapping entry mappinga source-side name, used by the first storage system to refer to thefile, to a location of where replicated data of the file is stored bythe second storage system.
 17. The non-transitory computer-readablemedium of claim 15, wherein the operations comprise: receiving aresponse message indicating that the first data is identical to thereplicated data; updating the fingerprint database to include an entryfor the first file, the entry comprising the fingerprint of the firstdata, the identity of the first file, and the file storage location ofwhere the first file is stored based upon the response message.
 18. Thenon-transitory computer-readable medium of claim 15, wherein theoperations comprise: receiving a response message indicating that thefirst data is not identical to the replicated data; receiving the firstfile and the first data; storing the first data within a local locationin the second storage system; updating the mapping database byassociating the name of the first data stored at the first location ofthe first storage system with the local location where the first data isstored in the second storage system; and updating the fingerprintdatabase to include a new entry for the first file, the new entryincluding the fingerprint of the first data, the identity of the firstfile, and the file storage location of where the first file is stored inthe first storage system.
 19. A computing device comprising: a memoryresource storing instructions; and at least one processor coupled to thememory resource, the at least one processor executing the instructionsto perform operations comprising: receiving a message comprising anidentity of a first file, a file storage location of where the firstfile is stored in a first storage system, a name of a first data used bythe first file, and a fingerprint of the first data, wherein the firstdata is stored at a first location of the first storage system;determining that a mapping database of a second storage system, whichmaps storage locations of data in the first storage system to storagelocations of replicated data in the second storage system, isunavailable; and accessing a fingerprint database of the second storagesystem using the fingerprint to determine whether replicated data storedin the second storage system has a second fingerprint identical to thefingerprint of the first data, wherein the fingerprint database storesentries of fingerprints of replicated data stored by the second storagesystem, wherein an entry of a first replicated data comprises acorresponding fingerprint for the first replicated data, a correspondingidentity of a file at the first storage system that uses the firstreplicated data, and a corresponding file storage location of where thefile is stored in the first storage system.
 20. The computing device ofclaim 19, wherein the corresponding file storage location, within theentry of the fingerprint database, corresponds to an offset value and alength value.